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Abstract 

We extend kernelized matrix factorization 
with a fully Bayesian treatment and with an 
ability to work with multiple side information 
sources expressed as different kernels. Kernel 
functions have been introduced to matrix fac- 
torization to integrate side information about 
the rows and columns (e.g., objects and users 
in recommender systems), which is necessary 
for making out-of-matrix (i.e., cold start) pre- 
dictions. We discuss specifically bipartite 
graph inference, where the output matrix is 
binary, but extensions to more general matri- 
ces are straightforward. We extend the state 
of the art in two key aspects: (i) A fully con- 
jugate probabilistic formulation of the kernel- 
ized matrix factorization problem enables an 
efficient variational approximation, whereas 
fully Bayesian treatments are not computa- 
tionally feasible in the earlier approaches, 
(ii) Multiple side information sources are in- 
cluded, treated as different kernels in multi- 
ple kernel learning that additionally reveals 
which side information sources are informa- 
tive. Our method outperforms alternatives in 
predicting drug-protein interactions on two 
data sets. 



1 Introduction 

Matrix factorization algori thms a r e very popular ma- 
trix completion methods ( Srebrol 2004), successfully 
used in many applications including recommender sys- 
tems and image inpainting. The main idea behind 



these methods is to factorize a partially observed data 
matrix by finding a low-dimensional latent representa- 
tion for both its rows and columns. The prediction for 
a missing entry can be calculated as the inner prod- 
uct between the latent repr esentations of the corre- 
spondin g row and column. ISalakhutdinov and Mnih 



( 2008a. b) give a probabilistic formulation for matrix 
factorization and its fully Bayesian extension. How- 
ever, these approaches are still incomplete in two ma- 
jor aspects: (i) It is not possible to integrate side infor- 
mation (e.g., social network or user profiles for recom- 
mender systems) into the model, (ii) It is not possible 
to make predictions for completely empty columns or 
rows (i.e., out-of-matrix prediction). 

Algorithms for integrating side information into ma- 
trix factorization have been pr oposed ea r lier in the 
recommender systems literature. Ma et al. (2008) pro- 
pose a probabilistic matrix factorization method that 
uses a social network and the rati ng matrix together 
to find better latent components. Shan and Banerieel 
( 2010f ) integrate side information into a probabilis- 
tic matrix factorization model using topic models to 
generate latent components of th e rated items (e.g., 
movies). lAgarwal and Chenl (|2010h use a similar strat- 
egy to generate latent co mponents of bot h user s and 
items using topic models. IWang and Bleil (|201ll ) also 
combine matrix factorization and topic models for sci- 
entific article recommendation using textual content 
of articles as side information. All these algorithms 
are based on explicit feature representations; some are 
specific to count (e.g., text) data, and all are able to 
model linear dependencies. We use kernels to include 
nonlinear dependencies. 

Recently, IZhou et al.l (|2012h propose a kernelized prob- 
abilistic matrix factorization method that generates 
latent components using Gaussian process priors with 
covariance matrices calculated on side information. 
However, with the modeling assumptions, only a max- 
imum a posteriori (MAP) estimate for the latent com- 
ponents is computationally feasible, and even then the 
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used gradient descent approach can be very slow. Fur- 
thermore, the method uses only a single kernel for each 
domain and needs test instances while training to be 
able to calculate their latent components (i.e., trans- 
ductive learning). 

In this paper, we focus on modeling interaction net- 
works between two domains (e.g., biological networks 
between drugs and proteins), and estimating un- 
known interactions between objects from these two 
doma ins, which i s also known as bipartite graph infer- 



ence ( Yamanishi . 20091) . The standard pairwise ker- 



nel approaches are based on a kernel matrix over ob- 
ject pairs in t he training set and are c omputation- 
ally expensive ( Ben-Hur and Noblei . 2005 ). There are 
also kernel-based (non-Bayesian) dimensionality re- 
duction algorithms that map objects from both do- 
mains into the same subspace and perform predictio n 
there (jYamanishil . 120091 lYamanishi et all . l2008l l2010h . 



In biological interaction networks, being composed of 
structured objects such as drugs and proteins, there 
exist several feature r epresentations or simila rity mea- 



sures for the objects (jScholkopf et all [2004). Instead 



of using a single specific kernel, we can combine mul- 
tiple kernel functions to obtain a better similarity 
measure, which is known a s multiple kernel learning 
(jGonen and Alpavdinl . 120111 ). 



We introduce a kernelized Bayesian matrix factoriza- 
tion method and give its details for the bipartite graph 
inference scenario; it can also be applied to other types 
of matrices with slight modifications. Our two main 
contributions are: (i) We formulate a novel fully con- 
jugate probabilistic model that allows us to develop 
an efficient variational approximation scheme, the first 
fully Bayesian treatment which is still significantly 
faster than the earlier method fo r computing MAP 



point estimates ( Zhou et al. . 20121) . (ii) The proposed 



method is able to integrate multiple side information 
sources by coupling matrix factorization with multi- 
ple kernel learning. We show the effectiveness of our 
approach on one toy data set and two drug-protein 
interaction data sets. 

2 Preliminaries and Notation 

We assume that the objects come from two domains X 
and Z. We are given two samples of independent and 
identically distributed training instances from each, 
denoted by X = {x, £ X}?^ and Z = {zj € Z}fl v 
In order to calculate similarities between the objects 
of the same domain, we have multiple kernel functions 
for each domain, namely, {fc x . m : X x X —> M} n ^_ 1 and 
{fc z .„ : Z x Z — > M.}^=x- If the side information comes 
in the form of features instead of similarities, the set 
of kernels defined for a specific domain correspond to 



different notions of similarity on the same feature rep- 
resentation or may be using information coming from 
multiple feature representations (i.e., views). 

The (i,j)th entry of the target label matrix Y 6 
{-!,+!}"■ 



N x xN z 



IS 



Vi = 



+1 if Xi and Zj are interacting, 
— 1 otherwise. 



The superscript indexes the rows and the subscript in- 
dexes the columns. The prediction task is to estimate 
unknown interactions for out-of-matrix objects, which 
is also known as cold start prediction in recommender 
systems. 

Figure [TJ illustrates the method we propose; it is com- 
posed of four main parts: (a) kernel-based nonlinear 
dimensionality reduction, (b) multiple kernel learning, 
(c) matrix factorization, and (d) binary classification. 
The first two kernel-based parts are applied to each 
domain separately and they are completely symmet- 
ric, hence we call them twins. One of the twins (i.e., 
the one that operates on domain Z) is omitted for clar- 
ity. In this section, we briefly explain each part and 
introduce the notation used. In the following sections, 
we formulate a fully conjugate probabilistic model and 
derive a variational approximation. 



predicted 
outputs 




input 
kernels 



kernel-specific 
components 



target labels 



Figure 1: Flowchart of kernelized matrix factorization 
with twin multiple kernel learning (the twin domain Z 
is omitted for clarity). 

2.1 Kernel-Based Nonlinear Dimensionality 
Reduction 

In this part, we perform feature extraction using the 
input kernel matrices {K xm € 



pN K xN x \P, 



Yr, 



and the 

common projection matrix A x £ R N * xR where R is the 
corresponding subspace dimensionality. We obtain the 
kernel-specific components {G x . m = AjK x m }^ =1 af- 
ter the projection. The main idea is very similar to 
kernel principal component analysis or kernel Fisher 
discriminant analysis, where the columns of the pro- 
j ection matrix can be solved with eigendecompositions 
(jScholkopf and Smo lal. l2002h . However, this solution 
strategy is not possible for the more complex model 
formulated here. 
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2.2 Multiple Kernel Learning 

This part is responsible of combining the kernel- 
specific components linearly to obtain the composite 



Px 



i e x ,mG Xjm where the kernel 



components H x = 
weights can take arbitrary values e x S M.^ 1 . If we have 
a single kernel function for a specific domain, we can 
safely ignore the composite components and the ker- 
nel weights, and use the single available kernel-specific 
components to represent the objects in that domain. 
The details of our method with a single kernel function 
for each domain are explained as the supplementary 
material in Appendix [Al 

2.3 Matrix Factorization 

In this part, we propose to use the low-dimensional 
representations of objects in the unified subspace, 
namely, H x and H z , to calculate the predicted output 
matrix F = HjH 2 . This corresponds to factorizing 
the predicted outputs into two low-rank matrices. 

2.4 Binary Classification 

This part just assigns a class label to each object pair 
(xi , Zj ) by looking at the sign of the predicted out- 
put /j in the matrix factorization part. The proposed 
method can also be extended to handle other types 
of outputs (e.g., real- valued outputs used in recom- 
mender systems) by removing the binary classification 
part and directly generating the target outputs in the 
matrix factorization part. This corresponds to remov- 
ing the predicted output matrix F and generating tar- 
get label matrix Y directly from the composite com- 
ponents H x and H 2 . The details of our method for 
real-valued outputs are also given as the supplemen- 
tary material in Appendix [Bj 

3 Kernelized Bayesian Matrix 

Factorization with Twin Multiple 
Kernel Learning 



with latent variables and their corresponding priors. 
There are some additions to the notation described 
earlier: The N x x R matrix of priors for the entries of 
the projection matrix A x is denoted by A x . The P x x 1 
vector of priors for the kernel weights e x is denoted by 
T7 X . The standard deviations for the kernel-specific and 
composite components are represented as a g and ah, 
respectively; these hyper-parameters are not shown for 
clarity. 




A x * 



Figure 2: Graphical model of kernelized Bayesian ma- 
trix factorization with twin multiple kernel learning. 

The distributional assumptions of the dimensionality 
reduction part are 

K <s ~9(\i, s ;a x ,j3 x ) V(i,a) 
^JA^-A^O,^)- 1 ) V(i,«) 

9x,m,i l a x,s, fejc,ro,i ~ ■^'{9x,m,i'l a x,s^^-,m,iy 0~ g ) V(m, S, l) 

where A"(-; /x, X) is the normal distribution with mean 
vector fi and covariance matrix S, and Q(-;a,/3) de- 
notes the gamma distribution with shape parameter a 
and scale parameter /?. The multiple kernel learning 
part has the following distributional assumptions: 



/ Px 



^ X ,il{ e x,TO,5 x ,m,i}m=l ~ AH ^x,i) Cx ' 



™5 x ,m,ii °h 



Vm 
Vto 

V(M) 



For the method described in the previous section, 
we formulate a probabilistic model, called kernelized 
Bayesian matrix factorization with twin multiple ker- 
nel learning (KBMF2MKL) , which has two key prop- 
erties that enable us to perform efficient inference: 
(i) The kernel-specific and composite components are 
modeled explicitly by introducing them as latent vari- 
ables, (ii) Kernel weights are assumed to be nor- 
mally distributed without enforcing any constraints 
(e.g., non- negativity) on them. The reasons for intro- 
ducing these two properties to our probabilistic model 
becomes clear when we explain our inference method. 

Figure H gives the graphical model of KBMF2MKL 



where kernel-level sparsity can be tuned by changing 
the hyper-parameters (a^, /3 n ). Setting the gamma pri- 
ors to induce sparsity, e.g., (a^,/3 ?) ) = (0.001,1000), 
produces results analogous to using the ^i-norm on 
the kernel weights, whereas using uninformative priors, 
e.g., (a,,,/?,,) = (1,1), resembles using the ^2-norm. 
The matrix factorization part calculates the predicted 
outputs using the inner products between the low- 
dimensional representations of the object pairs: 

where the predicted outputs are introduced to spee d 
up the inference procedures ([Albert and Chibl . Il993l ) . 



Kernelized Bayesian Matrix Factorization 



The binary classification part forces the predicted out- 
puts to have the same sign with their target labels: 

y\\fi ~ stfy} > v) v(i,j) 

where the margin parameter v is introduced to remove 
ambiguity in the scaling and to place a low-density re- 
gion between two classes, similar to the margin idea 
in SVMs, which is generally used for semi-supervised 
learning (jLawrence and Jordan . 120051 ). Here, S(-) is 
the Kronecker delta function that returns 1 if its ar- 
gument is true and otherwise. 

4 Efficient Inference Using Variational 
Approximation 

Exact inference for the model is intractable and of 
the two readily available alternatives, Gibbs sampling 
and variational approximation, we choose the latter 
for computational efficiency Variational methods op- 
timize a lower bound on the marginal likelihood, which 
involves a factorized approximation of the posterio r, to 
find the joint parameter distribution (jBeall . 12003). 



As short-hand notations, all hyper-parameters in the 
model are denoted by C — { a n, P-q, &X, Px, erg, <Jh, v}, 
all prior variables by S = {r7 x , r) z , A x , A z }, 
and the remaining random variables by © = 

{A x , A z , e x , e z , F, \Gx,m}m=li {Gz,n} n =i, H x , H 2 }. 

We omit the dependence on £ for clarity. We factorize 
the variational approximation as 

p(©,S|{K x , m }^ =1) {K Zi XL 1 ,Y)«g(0,H) = 
g(A x )g(A x )g({G x>m }^ =1 )g(r7 x )g(e x )g(H x ) 
g(A z )g(A z )g({G z>n }^ 1 )g(7 7z )g(e z )g(H z ) g (F) 

and define each factor according to its full conditional: 

N s R 

?(Ax)=nn^^^)-^t)) 

i=l s=l 



g(A x ) = JjA/"(a x , s ;/i(a x , s ),S(a x , s )) 



s=l 



g(G x , m ) = Y[J^(g x>m ,i, K9^ m ,i),^(9y,, m ,i)) 
g(e x ) = J\f(e x ; fi(e x ), £(e x )) 

<l(V*) = II £(f?x,m;a(?7x,m),/3(?7x.m)) 
m=l 
A 1 * 

g(H x ) = n^( h x, i ;M(ftx,i),S(h x , i )) 



Vrn 



2=1 



»=lj=l 



where a(-), /?(•), /x(-), and £(•) denote the shape pa- 
rameter, scale parameter, mean vector, and covariance 
matrix, respectively. Here, TM{-\ fi, S, p(-)) denotes 
the truncated normal distribution with mean vector 
/it, covariance matrix S, and truncation rule p(-) such 
that TAf(-; /x, X, p(-)) oc A/"(-;M, X) if p(-) is true and 
TM(-;ti,T,,p{-)) = otherwise. 

We can bound the marginal likelihood using Jensen's 
inequality: 

logpCYKK,,,,}^, 

E g(e> a) pogp(Y, 0, S|{K x , m }^ =1) {K^}^)] 

-E, (e ,B)pogg(e,S)] (1) 

and optimize this bound by maximizing with respect 
to each factor separately until convergence. The ap- 
proximate posterior distribution of a specific factor r 
can be found as 

q(r) oc 

exp(E g({ © ia}XT) [logp(Y,0,a|{K x , ro }^ =1 ,{K z , n }^ =1 )]). 

For our model, thanks to the conjugacy, the resulting 
approximate posterior distribution of each factor fol- 
lows the same distribution as the corresponding factor. 

The approximate posterior distributions of the dimen- 
sionality reduction part can be found as 



Nx R 



i=\ s=l 
R 




<7(A X 



n^iox,,;s(a x ,,)i: K,,m ^ ,m) 



s=l 



diag(A* 



x K K 

m=l 9 



(2) 



where the tilde notation denotes the posterior expec- 
tations as usual, i.e., /(r) = E g ( T J/(r)]. 

The kernel-specific components have the following ap- 
proximate posterior distribution: 



g(G x , ro ) = U^(g^ m xMg, 



erf, 



+ 



e 2 T 



Vm (3) 
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where the mean and covariance parameters are af- 
fected by the kernel weights, the composite compo- 
nents, and other kernel-specific components in addi- 
tion to the projection matrix and the corresponding 
kernel matrix. 

The approximate posterior distributions of the multi- 
ple kernel learning part can be found as 



q{e x ) =Af\ e x ;S(e x ) 



diag(?7x) 



C I TT 

VJ, x,m J - l x 



GT,,,, Gx,o 



m— 1 



(4) 



where the mean and covariance parameters of the ker- 
nel weights are calculated using the kernel-specific and 
composite components. 

The composite components have the following approx- 
imate posterior distribution: 



g(H x ) 



i=l 



m=l h 



H 2 (/ 




(5) 



where it can be seen that the inference transfers in- 
formation between the two domains. Note that the 
composite components of each domain are the only 
random variables that have an effect on the other do- 
main, i.e., only the H z variables of domain Z are used 
when updating the random variables of the domain X . 

The approximate posterior distribution of the pre- 
dicted outputs is a product of truncated normals: 

,/:K: ini 7A ' : /; :/t - / '^- 1 -/:^ (6) 

We need to find the posterior expectation of F to up- 
date the approximate posterior distributions of the 
composite components. Fortunately, the truncated 
normal distribution has a closed-form formula for its 
expectation. 

4.1 Modeling Choices 

Note that using the kernel-specific and composite com- 
ponents as latent variables in our probabilistic model 



introduces extra conditional independencies between 
the random variables and enables us to derive efficient 
update rules for the approximate posterior distribu- 
tions. The other key property of our model is the as- 
sumption of normality of the kernel weights, which al- 
lows us t o obta in a fully conjugate probabilistic model 
( Gonenl . |2012| ) . In earlier Bayesian multiple kernel 
learning algorithms, the combined kernel has usually 
been defined as a convex sum of the input kernels, by 
assuming a Dirichlet d i stribution on the kernel weights 
( Girolami and Rogers . 2005 : Damoulas and Girolamil . 
2008). As a consequence of the nonconjugacy between 
Dirichlet and normal distributions, they need to use a 
sampling strategy (e.g., importance sampling) to up- 
date the kernel weights when deriving variational ap- 
proximations. 

4.2 Convergence 

The inference mechanism sequentially updates the ap- 
proximate posterior distributions of the latent vari- 
ables and the corresponding priors until convergence, 
which can be checked by monitoring the lower bound 
in (fT|) . The first term of the lower bound corresponds 
to the sum of exponential-form expectations of the dis- 
tributions in the joint likelihood. The second term is 
the sum of negative entropies of the approximate pos- 
teriors in the ensemble. The only nonstandard distri- 
bution in these terms is the truncated normal distribu- 
tion used for the predicted outputs, and the truncated 
normal distribution has a closed-form formula also for 
its entropy. 

4.3 Computational Complexity 

The most time-consuming operations of the update 
equations are covariance calculations because they 
need matrix inversions. The time complexity of 
the covariance updates for the projection matri- 
ces in @ is 0(R max(A 3 , A 3 )) and we can cache 

E^i K *,™ K Zm and J2n=i K z,« K zU before starting 
inference procedure to reduce the computational cost 
of these updates. The covariance updates for the 
kernel-specific components in ([3]) require inverting di- 
agonal matrices. The time complexities of the covari- 
ance updates for the kernel weights and the compos- 
ite components in (0} and §5$ are 0(max(P 3 , P 3 )). 
The other calculations in these updates can be done 
efficiently using matrix-matrix or matrix-vector mul- 
tiplications. Finding the posterior expectations of 
the predicted outputs in ([6]) only requires evaluat- 
ing the standardized normal cumulative distribution 
function and the standardized normal probability den- 
sity. In summary, the total time complexity of each 
iteration in our variational approximation scheme is 
C>(Pmax(A 3 , A 3 ) + max(P 3 ,P 3 )), which makes the 
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algorithm m ore efficient than standard p airwise kernel 
approaches (|Ben-Hur and Nobld . 120051) that require 
calculating a kernel matrix over pairs and training a 
kernel-based classifier using this kernel, resulting in 
O(N^N^) complexity. 

4.4 Prediction 

Given a test pair (cc*, we want to predict the cor- 
responding score /* or target label y*. We first re- 
place posterior distributions of A x , A z , e x , and e z 
with their approximate posterior distributions g(A x ), 
q(A z ), g(e x ), and q(e z ). Using the approximate distri- 
butions, we obtain the predictive distributions of the 
kernel-specific and composite components. The pre- 
dictive distribution of the target label can finally be 
formulated as 



Kit) - y 

£(/:) 



where Z* is the normalization coefficient calculated 
for the test pair and $(•) is the standardized normal 
cumulative distribution function. 

5 Experiments 

We first run our method on a toy data set to illustrate 
its kernel learning capability. We then test its perfor- 
mance in a real-life application with experiments on 
two drug-protein interaction data sets. One of them 
is a standard data set with a single view for each do- 
main and the other one is a larger multiview data set 
we have collected. Our Matlab implementations will 
be available at |http : //research . ics . aalto . f i/mi/ 
software /kbmf 2mkT/| 

5.1 Toy Data Set 

We create a toy data set consisting of samples from 
two domains and real- valued outputs for object pairs. 
The data generation process is: 



'-V(z?;0,l) 



X A Z A I *JC A Z 



+ T 7 Z W 



1) 



V(m, i) 
V(n,j) 
V(*,j) 



where (7V X , N z ) = (40, 60), the samples from X and Z 
are generated from 15- and 10-dimensional isotropic 
normal distributions with unit variance (i.e., m 6 
{1, . . . , 15} and n 6 {1, . . . , 10}), respectively, and the 
target outputs are generated using only three features 
from each domain. Note that this data set has not 
been generated from our probabilistic model. 



In order to have multiple kernel functions for each do- 
main, we calculate a separate linear kernel for each 
feature of the data points, i.e., (P X ,P Z ) = (15,10). 
We then learn our model, intended to work as a pre- 
dictive model that identifies the relevant features for 
the prediction task and has a good generalization per- 
formance. We use uninformative priors for the pro- 
jection matrices and the kernel weights by setting 
(a n , f3 n , a\, — (1,1,1,1). The standard deviations 
are set to (o- g ,ah,o- y ) — (0.1,0.1,1). The subspace 
dimensionality is arbitrarily set to five (i.e., R = 5). 

Figure [3] shows the found posterior means of the ker- 
nel weights. We see that our method correctly iden- 
tifies the relevant features for each domain (i.e., the 
first, fourth, and seventh features for X and the third, 
eighth, and tenth features for Z). The root mean 
square error between the target and predicted outputs 
is 0.9865 in accordance with the level of noise added. 
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Figure 3: Posterior means of the kernel weights found 
by our method on the toy data set. 



5.2 Drug-Protein Data Se t of 



Yamanishi et al.l (|2008) 



As the first case study, we analyze a drug-protein 
interaction network of humans, involving enzymes in 
particular. This drug-protein interaction network con- 
tains 445 drugs, 664 proteins, and 2926 experimen- 
tally validated interactions between them. The data 
set consists of the chemical similarity matrix between 
drugs, the genomic similarity matrix between proteins, 
an d the target matr i x of k nown interactions provided 
bv lYamanishi et al.l (|2008l ). 



We compare one baseline and three matrix factoriza- 
tion methods: (i) Baseline simply calculates the tar- 
get output averages over each column or row as the 
predictions, (ii) kernelize d probabilistic ma trix factor- 
ization (KPMF) method of IZhou et aD (|2012h with real- 
valued outputs, (iii) our kernelized probabilistic matrix 



Mehmet Gonen, Suleiman A. Khan, Samuel Kaski 



factorization (KBMF) method with real-valued outputs, 
and (iv) KBMF method with binary outputs. 

Our experimental methodology is as follows: For KPMF, 
the standard deviation a y is set to one. For both KBMF 
variants, we use uninformativc priors for the projection 
matrices and the kernel weights, i.e., (a n , /3 n , a\,j3\) = 
(1, 1, 1, 1), and the standard deviations (a g , ah) are set 
to (0.1,0.1). For KBMF with real-valued outputs, the 
standard deviation a y is set to one. For KBMF with 
binary outputs, the margin parameter v is arbitrarily 
set to one. We perform simulations with eight different 
numbers of components, i.e., R £ {5, 10, . . . , 40}. We 
run five replications of five-fold cross validation over 
drugs and report the average area under ROC curve 
(AUC) over the 25 results as the performance measure. 

In the results, C and G mark the chemical similarity be- 
tween drugs and the genomic similarity between pro- 
teins, respectively, whereas N marks the similarity be- 
tween proteins calculated from the interaction network 
and it is defined as the ratio between (i) the number 
of drugs that are interacting with both proteins and 
(ii) the number of drugs that are interacting with at 
least one of the proteins, (i.e., Jaccard index). 

The results in Figure H] reveal that KPMF is above the 
baseline for more than 5 components, and both vari- 
ants of KBMF for all component numbers. Both vari- 
ants of our new KBMF outperform the earlier KPMF for 
all types of inputs. The difference is not due to KPMF 
having been introduced only for real-valued outputs, 
as even the real-output variant of KBMF is better. The 
difference is not due to the inability of the current ver- 
sion of KPMF to handle multiple data views either, as 
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Figure 4: Average prediction performances (area un- 
d er ROC curve, AUC) o n the drug-protein data set 
of lYamanishi et al.l (|2008l ). Gray solid line: Baseline; 
other solid lines: KBMF with binary outputs; dashed 
lines: KBMF; dash-dotted lines: KPMF. 



the single-kernel KBMF outperforms it. Hence the dif- 
ferences in the performance arc due to the differences 
in the inference: MAP point estimates versus fully 
Bayesian inference. The best results are obtained with 
the binary-output KBMF when using all data sources. 

Note that when we combine the genomic and network 
similarities between proteins using our method, the re- 
sulting similarity measure for proteins is better than 
those of single-kernel scenarios, leading to better pre- 
diction performance. This shows that when we have 
multiple side information sources about the objects, 
integrating them into the matrix factorization model 
in a principled way improves the results. 



5.3 Drug— Protein Data Set of Kh an et al. 
(|2012h 



We study an additional drug-protein interaction net- 
work of humans, containing 855 drugs, 800 proteins, 
and 4659 experimentally validated interactions be- 
tween them, extracted from the drugs a nd pr oteins 
of the data set provided by iKhan et al. ( 2012 ) . We 
have two views consisting of two standard 3D chem- 
ical structure descript ors for drugs , nam ely, 1120- 



dimensional Amand a dDuran et al 



2008) and 76- 
2000). In order 



dimensional VolSurf ()Cruciani et al 
to calculate the similarity between drugs, we use a 
Gaussian kernel whose width parameter is selected 
as the square root of the dimensionality of the data 
points. 

We repeat the same experimental procedure as in the 
previous experiment with one minor change only. We 
perform simulations with 16 different numbers of com- 
ponents, i.e., R £ {5, 10, ... , 80}, due to the larger size 
of the interaction network. 

We compare four different ways of including the drug 
property data. Amanda and VolSurf correspond to 
using a single view when calculating the kernel be- 
tween drugs. Early corresponds to concatenating 
the two views, which is known as early combination 



(jScholkopf et all 120041) , before calculating the kernel 
between drugs. MKL corresponds to calculating two dif- 
ferent kernels between drugs and combining them with 
our kernel combination approach. 

The overall ordering of the results of the different ma- 
trix factorization methods is the same as in the pre- 
vious case study (Figure [5]). The KPMF outperforms 
Baseline after 20 components, whereas KBMF is con- 
sistently and significantly better (by at least four per- 
centage units) than KPMF for all single-kernel scenar- 
ios. KBMF with five components is already better than 
Baseline for all scenarios. We do not report the re- 
sults of KBMF with real- valued outputs in order not to 
clutter the figure. 
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Figure 5: Average prediction performances (area un- 
der ROC cu r ve, A UC) on the drug-protein data set of 



Khan et al. (|2012h . Gray solid line: Baseline; solid 
lines: KBMF with binary outputs; dash-dotted lines 
KPMF. 



For KBMF with binary outputs, we see that Amanda and 
VolSurf are significantly better than Baseline and 
obtain similar prediction performances. Early out- 
performs Amanda and VolSurf with a slight margin, 
whereas MKL obtains consistently better results than 
all the other scenarios after five components. 

Our method can also be interpreted as a metric learn- 
ing algorithm since each kernel function can be con- 
verted into a distance metric. We test this property 
on the task of finding or retrieving drugs with similar 
functions. The idea is that since the targets are cen- 
trally important for the action mechanisms of drugs, 
a metric useful for predicting targets could be useful 
for retrieval of drugs as well. As the ground truth 
for relevance we use a standard therapeutic classifi- 
cation of the drugs according to the organ or system 
on which they act and/or their chemical characteristics 
(not used during learning); drugs having the same class 
are considered relevant. Figure [5] gives the precision at 
top-fc retrieved drugs, when each drug in turn is used 
as the query and the rest of the 855 drugs are retrieved 
in the order of similarity according to the learned met- 
ric. Early is better than Amanda and VolSurf, and 
MKL is the best. This shows that our method is able 
to learn a kernel function between drugs that is bet- 
ter for retrieval than the kernels either on single or 
concatenated views. 



6 Discussion 

We introduce a kernelized Bayesian matrix factoriza- 
tion method that can make use of multiple side in- 
formation sources about the objects (both rows and 
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Figure 6: Average precision (over query drugs) of re- 
trieval as a function of number k of retrieved drugs. 
Sec the text for details. 



columns) and be applied in various scenarios includ- 
ing interaction network modeling and recommender 
systems. Our two main contributions are: (i) for- 
mulating an efficient variational approximation scheme 
for inference with the help of a novel fully conjugate 
probabilistic model and (ii) coupling matrix factoriza- 
tion with multiple kernel learning to integrate multiple 
side information sources into the model. In contrast 
to the earlier k ernelized probabilis tic matrix factoriza- 
tion method of IZhou et al.l (|2012l ). for our probabilis- 
tic model, it is possible to derive a computationally 
feasible fully Bayesian treatment. We illustrate the 
usefulness of the method on one toy data set and two 
molecular biological data sets. 



An interesting topic for future research is to optimize 
the dimensionality of the latent components using a 
Bayesian model selection procedure. For example, we 
can share the same set of precision priors for the pro- 
jection matrices and determine the dim ensionality us- 
ing automatic relevance determination (lNeaill99fih . 
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Supplementary Material 

In this supplementary material, we provide details about two variants of our method: (A) for a single kernel 
function on each domain instead of the multiple kernels, and (B) for real-valued outputs instead of the binary 
outputs. 

A Kernelized Bayesian Matrix Factorization with Twin Kernels 

We formulate a simplified probabilistic model, called kernelized Bayesian matrix factorization with twin kernels 
(KBMF2K), for the case with a single kernel function for each domain. Figure [7] shows the graphical model of 
KBMF2K with latent variables and their corresponding priors. 




Figure 7: Graphical model of kernelized Bayesian matrix factorization with twin kernels. 
The distributional assumptions of the simplified model are 

>i.~S(>i.;<*x,P\) V(M) 
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As short-hand notations, all hyper-parameters in the model are denoted by £ = {a\, (3\,a g ,v\, all prior variables 
by S = {A X ,A Z }, and the remaining random variables by = {A x , A z , F, G x , G z }. We again omit the 
dependence on £ for clarity. We can write the factorized variational approximation as 

p(0, E|K X , K z , Y) E) = q(A x )q(A x )q(G x )q(A z )q(A z )q(G z )q(F) 

and define each factor in the ensemble just like its full conditional: 
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We can bound the marginal likelihood using Jensen's inequality: 

logp(Y|K x ,K z ) >E g(0!S) [logp(Y,©,H|K x ,K z )] -E 9(e ,a)pog«(©,B)] 

and optimize this bound by maximizing with respect to each factor separately until convergence. The approximate 
posterior distribution of a specific factor r can be found as 

q(r) oc exp(E g({0>H}Vr) [logp(Y,0,3|K x ,K z )]). 
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The approximate posterior distributions of the ensemble can be found as 
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B Kernelized Bayesian Matrix Factorization with Twin Multiple Kernel 
Learning for Real- Valued Outputs 

We modify our proposed model for binary- valued outputs to also handle real- valued outputs. Figure [5] illustrates 
the graphical model of the modified kernelized Bayesian matrix factorization with twin multiple kernel learning 
(KBMF2MKL) with latent variables and their corresponding priors. 



A x > 




Figure 8: Graphical model of kernelized Bayesian matrix factorization with twin multiple kernel learning for 
real- valued outputs. 



The distributional assumptions of the modified KBMF2MKL model are 
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As short-hand notations, all hyper-parameters in the model are denoted by C — Ayi c<a, /?a, ct s , <t/j, o~ y }, 
all prior variables by S = {t7 x , T7 z , A x , A z }, and the remaining random variables by = 
{A x , A z , e x , e z , {Gx, m }m = i, {Gz^j^j, H x , H z }. We again omit the dependence on £ for clarity. We can write 
the factorized variational approximation as 
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and define each factor in the ensemble just like its full conditional: 
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We can bound the marginal likelihood using Jensen's inequality: 

logp(Y|{K x , m }£ =1 ,{K z ,X=i) > E, ( e, S )[logp(Y,0,S|{K x , m }^ =1 ,{K 2 , n }^L 1 )] - E, (e , S )[log g (0,S)] 

and optimize this bound by maximizing with respect to each factor separately until convergence. The approximate 
posterior distribution of a specific factor r can be found as 

g(T)«exp(E, ({ e,B}\r)pogp(Y,e,S|{K Iim }^ =1 ,{K, in }^Li)]). 
The approximate posterior distributions of the ensemble can be found as 
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